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(54) An interactive speech recognition device 

(57) The present invention relates to an interactive 
speech recognition device that recognises speech and 
produces sounds or actions in response to the recogni- 
tion result. 

The device includes a microphone (1), a speech 
analysis area (2), a recognition area (3) ; a coefficient 



setting means (4) and output means (6,7.8: 11-16). The 
coefficient setting means (4) enables the airway of the 
output to be improved. Additional features include a 
temperature sensor, air pressure sensor calendar 
means to improve the airway further and to enable the 
output to be adaptive. 
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Description 

The present invention relates to an interactive 
speech recognition device that recognises speech and 
produces sounds or actions in response to the recogni- 
tion result. 

One example of this kind of interactive speech rec- 
ognition device is a speech recognition toy. For exam- 
ple, in the speech recognition toy disclosed in Japanese 
patent application No. H62-253093, multiple instruc- 
tions that will be used as speech instructions are pre- 
registered as recognition-target phrases. The speech 
signal issued by the child who is using the toy is com- 
pared to the speech signals that have been registered, 
and when there is a match, the electrical signal pre- 
specified for the speech instruction is output and causes 
the toy to perform a specified action. 

However, in this type of conventional toys, such as 
stuffed toy animals : that issue phrases or perform spec- 
ified actions based on the speech recognition result, the 
recognition result is often different from the actual word 
or phrase issued by the speaker: and even when the 
recognition result is correct, the toys usually cannot re- 
spond or return phrases that accommodate changes in 
the prevailing conditions or environment. 

Nowadays, sophisticated actions are required even 
of toys. For example, a child will quickly tire of a stuffed 
toy animal if it responds with "Good morning" when a 
child says "Good morning" to it regardless of the time of 
day. Furthermore, because this type of interactive 
speech recognition technology possesses the potential 
of being applied to game machines for older children or 
even to consumer appliances and instruments, devel- 
opment of more, advanced technologies have been de- 
sired. 

Therefore, an object of the present invention is to 
provide an interactive speech recognition device that 
possesses a function for detecting changes in circum- 
stances or environment, e.g., time of day, .that can re- 
spond to the speech issued by the user by taking into 
account the change in circumstances or environment, 
and that enables more sophisticated interactions. 

According to the present invention, there is provid- 
ed an interactive speech recognition device for recog- 
nising and responding to input speech comprising; 



means with said coefficients thereby enabling the 
recognition data accuracy to be improved. 



a speech analysing means for analysing input 
speech by comparing it to pre-registered speech 
patterns and for creating a speech data pattern; 
a speech recognition means for recognising the in- 
put speech by analysing the speech data pattern 
and deriving recognition data: 
a speech output means for outputting a response 
to said input speech using said recognition data: 
and characterised by 

a coefficient setting means for generating weighted 
coefficients for each of the pre-registered speech 
patterns and providing said speech recognition 



The interactive speech recognition device of the 
5 present invention recognises input speech by analysing 
and comparing it to pre-registered speech patterns and 
responds to the recognised speech; and is character- 
ised in that it comprises a speech analysis means for 
creating a speech data pattern by analysing the input 
10 speech; a variable data detection area for detecting var- 
iable date that affects the interaction content: a coeffi- 
cient setting means into which the variable data from 
said variable data detection area is input and that gen- 
erates a weighting coefficient for each pre-registered 
*5 recognition target speech according to said variable da- 
ta: a speech recognition means into which the speech 
data pattern output by said speech analysis means is 
input and that at the same time computes the final rec- 
ognition data by considering the weighting coefficient 
20 assigned to the speech recognised at that time by ob- 
taining a weighting coefficient for each of the multiple 
pre-registered recognition target speeches from said 
coefficient setting means, recognises said input speech 
based on the computed final recognition data, and that 
25 outputs the final recognition data of the recognised 
speech: a speech synthesis means for outputting syn- 
thesised speech data based on the final recognition data 
computed by said speech recognition means by consid- 
ering said coefficient; and a speech output means for 
30 outputting the output of said speech synthesis means to 
the outside. 

Said variable data detection means is. for example, 
a timing means for detecting time data: and said coeffi- 
cient setting means generates a weighting coefficient 
35 that corresponds to the time of day for each of the pre- 
registered recognition target speeches. In this case, the 
coefficient setting means can be configured to output 
the largest weighting coefficient for the recognised data 
if it occurs at the time (peak time) when it was correctly 
•to recognised most frequently in the past, and a smaller 
weighting coefficient as the time deviates from this peak 
time. 

Another embodiment of the interactive speech rec- 
ognition device of the invention recognises input speech 
45 by analysing and comparing it to present pre-registered 
speech patterns and responds to the recognised 
speech: and is characterised in that it comprises a 
speech analysis means for generating a speech data 
pattern by analysing the input speech: a speech recog- 
50 nition means for outputting the recognition data that cor- 
responds to said input speech based on the speech data 
pattern output by said speech analysis means: a timing 
means for generating time data: a response content lev- 
el generation means into which the time data from said 
55 timing means and at least one of the recognition count 
data correctly recognised by said speech recognition 
means are input, and that based on the input data, gen- 
erates response content level for changing the response 
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content for the input speech: a response content level 
storage means for storing the response level that corre- 
sponds to the time obtained by said response content 
level generation means; a response content creation 
that determines the response content appropriate for 
the response level generated by said response content 
level generation means, based on the recognition data 
from said speech recognition area, and that outputs cor- 
responding to response content data: a speech synthe- 
sis means for outputting synthesised speech data that 
corresponds to the response content date, based on the 
response content data from said response content cre- 
ation area: and a speech output means for outputting 
the output of said speech synthesis means to the out- 
side. 

Still another embodiment of the interactive speech 
recognition device of the present invention recognises 
input speech by analysing and comparing it to pre-reg- 
istered speech patterns and responds to the recognised 
speech: and is characterised in that it comprises a 
speech analysis means for generating a speech data 
pattern by analysing the input speech; a speech recog- 
nition means for outputting the recognition data that cor- 
responds to said input speech based on the speech data 
pattern output by said speech analysis means: a varia- 
ble data detection area for detecting variable data that 
affects the interaction content: a response content cre- 
ation means into which the variable data from said var- 
iable data detection area and the recognition data from 
said speech recognition area are input, and that based 
on said recognition data : outputs the response content 
data by taking said variable data into consideration: a 
speech synthesis means for outputting synthesised 
speech data in response to the response content data 
output by said response content creation area: and a 
speech output means for outputting the output of said 
speech synthesis means to the outside. 

Said variable data detection means is a tempera- 
ture sensor that measures the temperature of the usage 
environment and outputs the temperature data, and said 
response content creation means outputs the response 
content data by taking said temperature data into con- 
sideration. 

Alternatively, said variable data detection means is 
an air pressure temperature sensor that measures the 
air pressure of the usage environment and outputs the 
air pressure data, and said response content creation 
means outputs the response content data by taking said 
air pressure data into consideration. 

Alternatively, said variable data detection means is 
a calendar detection means that detects calendar data 
and outputs the calendar data : and said response con- 
tent creation means outputs the response content data 
by taking said calendar data into consideration. 

The invention assigns a weighting coefficient to the 
recognition data of each of the pre-registered recogni- 
tion target speeches, based on the changes in the var- 
iable data (e.g., time of day, temperature, weather, and 



date) that affects the content of the interaction. If time 
of data is used as the variable data, for example, a 
weighting coefficient can be assigned to each recogni- 
tion data of recognition target speeches according to the 
5 time of day, and speech recognition that considers the 
weighting coefficients can be performed by taking into 
consideration whether or not the phrase (in particular a 
greeting phrase) issued by the speaker is appropriate 
for the time of day. Therefore, even if the speech anal- 
10 ysis result shows that multiple recognition target 
speeches exist that possess a similar speech pattern, 
weighting coefficients can increase the differences 
amount the numerical values of the recognition data that 
are ultimately output, thus improving the recognition 
15 rate. The same is also true for other various types of 
variable data mentioned above, in addition to time of 
day. For example, if weighting coefficients that corre- 
spond to the current temperature are set up, whether or 
not the greeting phrase issued by the speaker is appro- 
ve priate relative to the current temperature can be deter- 
mined. Here again, even if the speech analysis result 
shows that multiple recognition target speeches exist 
that possess a similar speech pattern, weighting coeffi- 
cients can increase the differences among the numeri- 
cs cal values of the recognition data that are ultimately out- 
put, thus improving the recognition rate. 

Furthermore, when time of day is used as the vari- 
able data, the relationship between phrases and times 
of day that matches actual usage can be obtained by 
30 detecting the time of day at which a particular phrase is 
used most often and assigning a large weighting coef- 
ficient to this peak time, and smaller weighting coeffi- 
cients to times of day that deviate farther from this peak 
time. 

35 Additionally the response content level can be 
changed in response to the speaker's phrase by gener- 
ating the response content level for changing the re- 
sponse content for the input speech as time passes, and 
by issuing an appropriate response by determining the 

•*o response content that matches said response level 
based on the recognition data from the speech recogni- 
tion area. 

Furthermore, by using data from instruments such 
as a temperature sensor or air pressure sensor, or var- 
45 iable data such as calendar data, and creating the re- 
sponse content based on these data, the response con- 
tent can be varied widely, enabling more meaningful in- 
teractions. 

Embodiments of the present invention will now be 
50 described with reference to the accompanying draw- 
ings, of which the invention is explained in detail below 
using working examples. Note that the invention has 
been applied to a toy in these working examples, and 
more particularly to a stuffed toy dog intended for small 
55 children. 

Figure 1 is a block diagram showing the overall con- 
figuration of the stuffed toy dog of Working example 
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1 of the present invention: 

Figure 2 is a block diagram showing the overall con- 
figuration of Working example 2 of the present in- 
vention: s 

Figure 3 is a block diagram showing the overall con- 
figuration of Working example 3 of the present in- 
vention; 

u 

Figure 4 is a block diagram showing the overall con- 
figuration of Working example 4 of the invention: 

Figure 5 is a block diagram showing the overall con- 
figuration of Working example 5 of the present in- is 
vention; and 

Figure 6 is a block diagram showing the overall con- 
figuration of Working example 6 of the invention. 

20 

The invention is explained in detail below using 
working examples. Note that the invention has been ap- 
plied to a toy in these working examples, and more par- 
ticularly to a stuffed toy dog intended for small children. 

In working example 1 , weighting coefficients are set 25 
up for the recognition data of pre-registered recognition 
target speeches according to the value of the variable 
data (e.g., time of day, temperature, weather and date) 
that affects the interaction content, in order to improve 
the recognition rate when a greeting phrase is input. Fig- 30 
ure 1 is a configuration diagram that explains Working 
example 1 of the present invention. The configuration 
will be briefly explained first, and the individual functions 
will be explained in detail later in the document. Note 
that working example 1 uses time of day as said variable 35 
data that affects the content of the interaction. 

In Figure 1 : the interior of stuffed toy dog 30 is pro- 
vided with a microphone 1 for entering speeches from 
outside. A speech analysis area 2 is provided for ana- 
lysing the speech that is input from said microphone 1 40 
and for generating a speech pattern that matches the 
characteristics volume of the input speech. There is a 
clock area 5 which is a timing means for outputting tim- 
ing data such as the time at which said speech is input 
and the time at which this speech input is recognised by 45 
the speech recognition area described below. A coeffi- 
cient setting area 4 into which the time data from said 
clock area 5 is input and that generates weighting coef- 
ficients that change over time, in correspondence with 
the content of each recognition target speech. A speech so 
recognition area 3 into which the speech data pattern of 
the input speech output by said speech analysis area 2 
is input, that at the same time obtains a weighting coef- 
ficient in effect for a registered recognition target speech 
at the time from said speech recognition area 4, that ss 
computes the final recognition data by multiplying the 
recognition data corresponding to each recognition tar- 
get speech by its corresponding weighting coefficient, 



that recognises said input speech based on the comput- 
ed final recognition data, and that outputs the final rec- 
ognition data of the recognised speech. Speech synthe- 
sis area 6 for outputting the speech synthesis data that 
corresponds to the final recognition data recognised by 
taking said coefficient from said speech recognition area 
3 into consideration. Drive control area 7 for driving a 
motion mechanism 10 which moves the mouth, etc. of 
the stuffed toy 30 according to the drive condition that 
1 are predetermined in correspondence to the recognition 
data recognised by said speech recognition area 3. A 
speaker 8 for outputting the content of the speech syn- 
thesised by said speech synthesis area 6 to the outside. 
Finally, there is a power supply area 9 for driving all of 
the above areas. 

Said speech recognition area 3 in the example uses 
a neutral network that handles a non-specific speaker, 
as its recognition means. However the recognition 
means is not limited to the method that handles a non- 
specific speaker and other known methods such as a 
method that handles a specific speaker, DP matching, 
and HMM, can be used as the recognition means. 

In said motion mechanism 10, a motor 11 rotates 
based on the drive signal (which matches the length of 
the output signal from speech synthesis area 6) output 
by drive control area 7\ and when cam 1 2 rotates in con- 
junction with motor 11, a protrusion-shaped rib 13 pro- 
vided on cam 1 2 moves in a circular trace in conjunction 
with the rotation of cam 12. Crank 15 which uses axis 
14 as a fulcrum is clipped on rib 13, and moves a lower 
jaw 1 6 of the stuffed toy dog up and down synchronously 
with the rotation of the cam 1 2. 

In this configuration, the speech that is input from 
the microphone 1 is analysed by speech analysis area 
2. and a speech data pattern matching the characteristic 
volume of the input speech is created. This speech data 
pattern is input into the input area of the neural network 
provided in speech recognition area 3, and is recog- 
nised as explained below. 

The explanation below is based on an example in 
which several greeting words or phrases are recog- 
nised. For example, greeting phrases such as "Good 
morning," "I'm leaving," "Good day," "I'm home," and 
"good night" are used here for explanation, 

Suppose that a phrase "Good morning- issued by 
a non-specific speaker is input into microphone 1 . The 
characteristics of this speaker's "Good morning" are an- 
alysed by speech analysis area 2 and are input into 
speech recognition area 3 as a speech data pattern. 

At the same time at which the phrase "Good morn- 
ing" is input from microphone 1 was detected as sound 
pressure, the data related to the time at which the 
phrase "Good morning" was recognised by the neural 
network of speech recognition area 3 is supplied from 
clock area 5 to coefficient setting area 4. Note that the 
time to be referenced by coefficient setting area 4 is the 
time the speech was recognised by speech recognition 
area 3 in this case. 
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Said speech data pattern of "Good morning" that 
was input into the neural network of speech recognition 
area 3 in this way is output from the output area of the 
neural network as recognition data possessing a value, 
instead of binary data. Here, an example in which this 
value is a number between 0 and 1 0 possessing a float- 
ing point is used for explanation. 

When the speaker says "Good morning'' to the 
stuffed toy 30, the neural network of speech recognition 
area 5 outputs a recognition data value of 8.0 for "Good 
morning," 1.0 for "I'm leaving," 2.0 for "Good day," 1.0 
for Tm home," and 4.0 for "Good night." The fact that 
the recognition data from the neural network for the 
speaker's "Good morning" is a high value of 8.0 is un- 
derstandable. The reason why the recognition data val- 
ue for "Good night" is relatively high compared to those 
for "I'm leaving," "Good day," and "I'm home" is pre- 
sumed to be because the speech pattern data of "Good 
morning" and "Good night" of a non-specific speaker, 
analysed by speech analysis area 2, are somewhat sim- 
ilar to each other. Therefore, although the probability is 
nearly non-existent that the speaker's "Good morning" 
will be recognised as "I'm leaving," "Good day," or "I'm 
home/' the probability is high that the speaker's "Good 
morning" will be recognised as "Good night." 

During this process, speech recognition area 3 
fetches the weighting coefficient preassigned to a rec- 
ognition target speech by referencing coefficient setting 
area 4, and multiplies the recognition data by this coef- 
ficient. Because different greeting phrases are used de- 
pending on the time of day, weighting coefficients are 
assigned to various greeting phrases based on the time 
of day. For example, if the current time is 7.00 am, 1 .0 
will be used as the weighting coefficient for "Good morn- 
ing," 0.9 for "I'm leaving," 0.7 for "Good day," 0.6 for "I'm 
home, " and 0.5 for "Good night," and these relationships 
among recognition target speeches, time of day, and co- 
efficients are stored in coefficient setting area 4 in ad- 
vance. 

When weighting coefficients are used in this way, 
the final recognition data of "Good morning" will be 8.0 
(i.e., 8.0 X 1.0) since the recognition data for "Good 
morning" output by the neural network is 8.0 and the 
coefficient for "Good morning" at 7.00 a.m. is 1 .0. Like- 
wise, the final recognition data for "I'm leaving" will be 
0.9 (i.e., 1 .0 X 0.9), the final recognition data for "Good 
day" will be 1.4(i.e. : 2.0X0.7), the final recognition data 
for "I'm home" will be 0.6 (i.e., 1 .0 X 0.6), and the final 
recognition data for "Good night" will be 2.0 (i.e. 4.0 C 
0.5). In this way, speech recognition area 3 creates final 
recognition data by taking time-dependent weighting co- 
efficients into consideration. 

When the final recognition data area determined by 
taking time-dependent weighting coefficients into con- 
sideration in this way, the final recognition data for 
"Good morning" is four times larger than that for "Good 
night." As a result, speech recognition area 3 can accu- 
rately recognise the phrase "Good morning" when it is 



issued by the speaker. Note that the number of phrases 
that can be recognised can be set to any value. 

The final recognition data of the phrase "Good 
morning" determined in this way is input into speech 
5 synthesis area 6 and drive control area 7. Speech syn- 
thesis area 6 converts the final recognition data from 
speech recognition area 3 to pre-determined speech 
synthesis data, and outputs that speech synthesis out- 
put from speaker 8. For example, "Good morning" will 
10 be output from speaker 8 in response to the final recog- 
nition data of the phrase "Good morning" in this case. 
That is, when the child playing with the stuffed toy says 
"Good morning" to the toy, the toy responds with "Good 
morning." This is because the phrase issued and the 
*5 time of day match each other since the child says "Good 
morning" at 7.00 a.m. As a result "Good morning" is cor- 
rectly recognised and an appropriate response is re- 
turned. 

At the same time, drive control area 7 drives indi- 
vidual action mechanisms according to the drive condi- 
tions pre-determined for said final recognition data. 
Here, the mouth of the stuffed toy dog 30 is moved syn- 
chronously with the output signal ("Good morning" in 
this case) from speech synthesis area 6. Naturally, in 
addition to moving the mouth of the stuffed toy, it is pos- 
sible to move any other areas, such as shaking the head 
or tail, for example. 

Next, a case in which the current time is 8.00 p.m. 
is explained. In this case, 0.5 is set as the weighting co- 
efficient for "Good morning," 0.6 for "I'm leaving," 0.7 for 
"Good day," 0.9 for "I'm home," and 1 .0 for "Good night." 

When weighting coefficients are used in this way, 
the final recognition data of "Good morning" will be 4.0 
(i.e., 8.0 X 0.5) since the recognition data for "Good 
morning" output by the neural network is 8.0 and the 
weighting coefficient for "Good morning" at 8.00 p.m. is 
0.5. Likewise, the final recognition data for "I'm leaving" 
will be 0.6 (i.e., 1 .0 X 0.6), the final recognition data for 
"Good day" will be 1 .4 (i.e. 2.0 X 0.7), the final recogni- 
tion data for "I'm home" will be 0.9 (i.e., 1 .0 X 0.9), and 
the final recognition data for "Good night" will be 4.0 (i. 
e., 4.0 X 1.0). 

In this way, speech recognition area 3 creates final 
recognition data by taking weighting coefficients into 
consideration. Since the final recognition data for both 
"Good morning" and "Good night" are 4.0, the two 
phrases cannot be differentiated. In other words, when 
the speaker says "Good morning" at 8.00 p.m., it is not 
possible to determine whether the phrase is "Good 
morning" or "Good night." 

This final recognition data is supplied to speech 
synthesis area 6 and drive control area 7, both of which 
act accordingly. That is speech synthesis area 6 con- 
verts the final recognition data to a pre-determined am- 
biguous speech synthesis data and outputs it. For ex- 
ample, "Something is funny here! n is output from speak- 
er 8, indicating that "Good morning" is not appropriate 
for use at night time. 
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At the same time, drive control area 7 drives indi- 
vidual action mechanisms according to the drive condi- 
tions pre-determined for said final recognition data. 
Here, the mouth of the stuffed toy dog is moved syn- 
chronously with the output signal ("Something is funny 
here!" in this case) from speech synthesis area 3. Natr 
urally, in addition to moving the mouth of the stuffed toy, 
it is possible to move any other areas, as in the case 
above. 

Next, a case in which the speaker says "Good night" 
when the current time if 8.00 p.m. is explained. In this 
case, it is assumed that the neural network of speech 
recognition area 3 outputs a recognition data value of 
4.0 for "Good morning, ■ 1.0 for "I'm leaving," 2.0 for 
"Good day," 1 .0 for "I'm home/ and 8.0 for "Good night. 
" When the current time is 8.00 pm, 0.5 will be used as 
the weighting coefficient for "Good morning," 0.6 for "I'm 
leaving," 0.7 for "Good day," 0.9 for Tm home," and 1 .0 
for "Good night,". When weighting coefficients are used 
in this way, the final recognition data of "Good morning" 
will be 2.0 (i.e., 4.0 X 0.5) since the recognition data for 
"Good morning" output by the neural network is 4.0 and 
the weighting coefficient for "Good morning" at 8.00 pm 
is 0.5. Likewise, the final recognition data for Tm leav- 
ing" will be 0.9 (i.e. 1 .0 X 0.9), the final recognition data 
for "Good day" will be 1 .4 (i.e., 2.0 X 0.7), the final rec- 
ognition data for "I'm home" will be 0.6 (I.e. 1 .0 X 0.6). 
and the final recognition data for "Good night" will be 8.0 
(i.e., 8.0 X 1.0). In this way, speech recognition area 3 
creates final recognition data by taking weighting coef- 
ficients into consideration. 

When the final recognition data is determined by 
taking time-related information into consideration in this 
way, the final recognition data for "Good night" is four 
times larger than that for "Good morning.". As a result, 
speech recognition area 3 can accurately recognise the 
phrase "Good night" when it is issued by the speaker. 

The final recognition data of the phrase "Good 
night" determined in this way is input into speech syn- 
thesis area 6 and device control area 7. Speech synthe- 
sis area 6 converts the final recognition data from 
speech recognition area 5 to pre-determined speech 
synthesis data, and outputs that speech synthesis out- 
put from speaker 7. For example, "Good night* 1 will be 
output from speaker 8 in response to the final recogni- 
tion data of the phrase "Good night" in this case. 

Although the response from the stuffed toy 30 is 
"Good morning" or "Good night" in response to the 
speaker's "Good morning" or "Good night", respectively, 
in the above explanation, it is possible to set many kinds 
of phrases as the response. For example, "You're up 
early today" can be used in response to "Good morning. 

Furthermore, although the time of day was used as 
the variable data for setting weighting coefficients in 
Working example 1 B it is also possible to set weighting 
coefficients based on other data such as temperature, 
weather, and date. For example, if temperature is used 



as the variable data, temperature data is detected from 
a temperature sensor that measures the air tempera- 
ture, and weighting coefficients are assigned to the rec- 
ognition data for weather-related greeting phrases (e. 

^ g., "It's hot, isn't it?" or "It's cold, isn't it?") that are input 
and to other registered recognition data. In this way, the 
difference in the values of the two recognition data is 
magnified by their weighting coefficients even if a 
speech data pattern that is similar to the input speech 

*o exists, thus increasing the recognition rate. Further- 
more, if a combination of variable data such as time of 
day, temperature, weather, and date, is used and 
weighting coefficients are assigned to these variable da- 
ta, the recognition rate for various greeting phrases can 

'5 be increased even further. 

Working example 2) 

Next, Working example 2 of the present invention 
■o will be explained with reference to Figure 2. Note that 
the stuffed toy dog 30, motion mechanism 1 0 for moving 
the mouth of the stuffed toy, etc. are omitted from Figure 
2. Figure 2 is different from Figure 1 in that a memory 
area 21 is provided for storing the weighting coefficients 
5 for recognisable phrases that are set by coefficient set- 
ting area 4 according to time data. Since all other con- 
figuration elements are identical as in Figure 1 , like sym- 
bols are used to represent like parts. The processing 
between memory area 21 and coefficient setting area 4 
30 will be explained later. 

In Figure 2, the speech that is input from micro- 
phone 1 is analysed by speech analysis area 2, and a 
speech data pattern matching the characteristic volume 
of the input speech is created. This speech data pattern 
35 is input into the input area of the neural network provided 
in speech recognition area 3, and is recognised as ex- 
plained below. 

The explanation below is based on an example in 
which several greeting words or phrases are recog- 
*o nised. For example, greeting phrases such as "Good 
morning," Tm leaving," "Good day," "I'm home," and 
"Good night" are used here for explanation. 

Suppose that a phrase "Good morning" issued by 
a non-specific speaker is input into microphone 1 . The 
« characteristics of this speaker's "Good morning" are an- 
alysed by speech analysis area 2 and are input into 
speech recognition area 3 as a speech data pattern. 

At the same time at which the phrase "Good morn- 
ing" input from microphone 1 was detected as sound 
o pressure, the data related to the time at which the 
phrase "Good morning" is recognised by the neural net- 
work of speech recognition area 3 was supplied from 
clock area 5 to coefficient setting area 4. Note that the 
time to be referenced by coefficient setting area 4 is the 
s time the speech was recognised by speech recognition 
area 3 in this case. 

Said speech data pattern of "Good morning" that 
was input into the neural network of speech recognition 
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area 3 in this way is output from the output area of the 
neural network as recognition data possessing a value 
instead of binary data. Here, an example in which this 
value is a number between 0 and 1 0 possessing a float- 
ing point is used for explanation. 

When the speaker says "Good morning M to the 
stuffed toy 30, the neural network of speech recognition 
area 3 outputs a recognition data value of 8.0 for "Good 
morning," 1 .0 for Tm leaving," 2.0 for "Good day, 1 .0 for 
-I'm home," and 4.0 for "Good night. M The fact that the 
recognition data from the neural network for the speak- 
er's "Good morning" is a high value of 8.0 is understand- 
able. The reason why the recognition data value for 
"Good night" is relatively high compared to those for "I'm 
leaving," "Good day," and "I'm home M is presumed to be 
because the speech pattern data of "Good morning" and 
"Good night" of a non-specific speaker analysed by 
speech analysis area 2, are somewhat similar to each 
other. Therefore, although the probability is nearly non- 
existent that the speaker's "Good morning" will be rec- 
ognised as "I'm leaving," "Good day," or "I'm home," the 
probability is high that the speaker's "Good morning" will 
be recognised as "Good night. " Up to this point, Working 
example 2 is nearly identical to Working example 1 . 

Speech recognition area 3 fetches the weighting co- 
efficient assigned to a recognition target speech accord- 
ing to time data by referencing coefficient setting area 
4. However in Working example 2, memory area 21 is 
connected to coefficient setting area 4, and the content 
(weighting coefficients) stored in memory area 21 is ref- 
erenced by coefficient setting area 4. Note that coeffi- 
cient setting area 4 outputs a large weighting coefficient 
to the recognition data of a phrase if the phrase occurs 
at the time of day it was most frequently recognised, and 
outputs a smaller weighting coefficient to the recognition 
data of the phrase as the phrase occurs away from said 
time of day. In other words, the largest weighting coef- 
ficient is assigned to the recognition data when the 
phrase occurs at the time of day with the highest usage 
frequency, and a smaller weighting coefficient is as- 
signed tot he recognition data as the phrase occurs 
away from said time of day. 

For example, if it is assumed that the current time 
is 7.00 am. and that 1 .0 is used as the initial weighting 
coefficient for "Good morning," 0.9 for "I'm leaving," 0.7 
for "Good day," 0.6 for "I'm home," and 0.5 for "Good 
night," and these coefficients are stored in memory area 
21 , the final recognition data of "Good morning" will be 
8.0 (i.e., 8.0 X 1 .0) since the recognition data for "Good 
morning" output by the neural network is 8.0 and the 
coefficient for "Good morning" fetched from memory ar- 
ea 21 at 7.00 am is 1.0. Likewise, the final recognition 
data will be 0.9 for "I'm leaving," 1 .4 for "Good day," 0.6 
for "I'm home," and 2.0 for "Good night." These final rec- 
ognition data are initially created by speech recognition 
area 3. 

Even when recognition is performed by taking into 
consideration the weighting coefficient based on the 



time of day, there is some range of time in which a cer- 
tain phrase will be correctly recognised. For example, 
the phrase "Good morning" may be correctly recognised 
at 7.00 a.m., 7.30 am. or 8.00 a.m. By taking this factor 

5 into consideration, memory area 21 stores the largest 
weighting coefficient for a phrase when it occurs at the 
time of day with the highest usage frequency based on 
the time data for recognising that phrase in the past, and 
stores a smaller weighting coefficient for the phrase as 

10 jt occurs away from said time of day. 

For example, if the phrase "Good morning" was 
most frequently recognised at 7.00 am according to the 
past statistics, the coefficient to be applied to the recog- 
nition data of "Good morning" is set the largest when the 

is time data indicates 7.00 a.m., and smaller as the time 
data deviates farther away from 7.00 a.m. That is, the 
coefficient is set at 1.0 for 7.00 a.m., 0.9 for 8.00 am, 
and 0.8 for 9.00 a.m., for example. The time data used 
for setting coefficients is statistically created based on 

20 several past time data instead of just one time data. Note 
that the coefficients during the initial setting are set to 
standard values for pre-determined times of day. That 
is, in the initial state, the weighting coefficient for "Good 
morning" at 7.00 a.m. is set to 1.0. 

25 The coefficient of the "Good morning" that is most 
recently recognised is input into memory area 21 as a 
new coefficient data along with the time data, and mem- 
ory area 21 updates the coefficient for the phrase based 
on this data and past data as needed. 

30 By making the coefficient for a phrase the largest at 
the time of day when it is used most frequently when 
the phrase "Good morning" is issued at around 7.00 am, 
the final recognition data of "Good morning" will be 8.0 
(i.e.. 8.0 X 1.0) since the recognition data for "Good 

35 morning" output by the neural network is 8.0 and the 
coefficient for "Good morning" fetched from memory ar- 
ea 21 at 7.00 am is 1 .0. Since this final recognition data 
is at least four times larger than those of other phrases, 
the phrase "Good morning" is correctly recognised by 

40 speech recognition area 3. 

The final recognition data of the phrase "Good 
morning" determined in this way is input into speech 
synthesis area 6 and drive control area 7. Speech syn- 
thesis area 6 converts the final recognition data from 

45 speech recognition area 3 to pre-determined speech 
synthesis data, and a pre-set phrase such as "Good 
morning" or "You're up early today" is returned through 
speaker 8 embedded in the body of the stuffed toy dog, 
as a response to the speaker's "Good morning." 

50 On the other hand, if "Good morning" is issued at 
around 12 noon, the coefficient for "Good morning" be- 
comes small, making the final recognition data for "Good 
morning" small, and "Good morning" will not be recog- 
nised. In such a case, speech synthesis area 6 is pro- 

55 grammed to issue a corresponding phrase as in Working 
example 1 , and a response such as "Something is funny 
here!" is issued by stuffed toy 30. 



7 



13 



EP 0 730 261 A2 



14 



(Working example 3) 



Next Working example 3 of the present invention 
will be explained with reference to Figure 3. Note that 
the stuffed toy dog 30, the motion mechanism 10 for 
moving the mouth of the stuffed toy : etc. shown in Figure 
1 are omitted from Figure 3. Working example 3 is pro- 
vided with microphone 1 for entering speeches from out- 
side, speech analysis area 2 for analysing the speech 
that is input from said microphone 1 and for generating 
a speech pattern that matches the characteristic volume 
of the input speech; clock area 5 for outputting timing 
data; speech recognition area 3 for outputting the rec- 
ognition data for said input speech based on the speech 
data pattern output by said speech analysis area 2: 
speech synthesis area 6 for outputting the speech syn- 
thesis data that corresponds to the final recognition data 
recognised by taking said coefficient from said speech 
recognition area 3 into consideration: drive control area 
7 for the driving motion mechanism 10 (see Figure 1) 
which moves the mouth : etc. of the stuffed toy 30 ac- 
cording to the drive condition that are predetermined in 
correspondence to the recognition data recognised by 
said speech recognition area 3: speaker 8 for outputting 
the content of the speech synthesised by said speech 
synthesis area 6 to the outside; and power supply area 
9 for driving all of the above areas: and is additionally 
provided with response content level generation area 
31; response content level storage area 32; and re- 
sponse content creation area 33. 

Said speech recognition area 3 in the example uses 
a neural network that handles a non-specific speaker 
as its recognition means. However, the recognition 
means is not limited to the method that handles a non- 
specific speaker and other known methods such 'as a 
method that handles a specific speaker. DP matching, 
and HMM, can be used as the recognition means. 

Said response content level generation area 31 
generates response level values for increasing the level 
of response content as time passes or as the number of 
recognitions by speech recognition area 3 increases. 
Response content level storage area 32 stores the re- 
lationship between the response level generated by re- 
sponse content level generation area 31 and time. That 
is, the relationship between the passage of time and lev- 
el value is stored, e.g., level 1 when the activation switch 
is turned on for the first time after the stuffed toy is pur- 
chased, level 2 after 24 hours pass, and level 3 after 24 
more hours pass. 

When it receives the final recognition data from 
speech recognition area 3, said response content crea- 
tion area 33 references said response content level gen- 
eration area 31 and determines response content that 
corresponds to the response control level value. During 
this process, response content level generation area 31 
fetches the response content level that corresponds to 
the time data from response content level storage area 
32. For example, response content level 1 is fetched if 



the current time is within the first 24 hours after the 
switch was turned on for the first time, and level 2 is 
fetched if the current time is between 24th and 48th 
hours. 

5 Response content creation area 33 then creates 
recognition data possessing the response content that 
corresponds tot he fetched response content level, 
based on the recognition data from speech recognition 
area 3. For example, "Bow-wow" is returned for recog- 
io nition data "Good morning" when the response content 
level (hereafter simply referred to as "level") is 1 . broken 
"G-o-o-d mor-ning" for level 2, "Good morning 1 * for level 
3, and "Good morning. It's a nice day, isn't it ? for a high- 
er level n. In this way, both the response content and 
is level are increased as time passes. The response data 
created by said response content creation area 33 is 
synthesised into speech by speech synthesis area 6 and 
is output from speaker 8. 

Suppose that a phrase "Good morning" issued by 
20 a non-specific speaker is input into microphone 1 . The 
characteristics of this speaker's "Good morning" are an- 
alysed by speech analysis area 2 and are input into 
speech recognition area 3 as a speech data pattern. 
Said speech data pattern of "Good morning" that 
25 was input into the neural network of speech recognition 
area 3 in this way is output from the output area of the 
neural network as a recognition data possessing a val- 
■ ue. instead of a binary data. If the recognition data for 
the phrase "Good morning" is higher than those recog- 
30 nition data for other phrases, speech recognition area 3 
correctly recognises the speaker's "Good morning" as 
"Good morning." 

The recognition data for the phrase "Good morning" 
thus identified is input into response content creation ar- 
35 ea 33. Response content creation area 33 then deter- 
mines the response content for the input recognition da- 
ta, based on the input recognition data and the content 
of response content level generation area 31. 

As explained above, the response level value from 
-*o said response content level generation area 31 is used 
for gradually increasing the level of response content in 
response to the phrase issued by the speaker: and in 
this case, the level is increased as time passes based 
on the time data of clock area 3. However it is also pos- 
45 sible to change the level value based on the number or 
types of phrases recognised, instead of the passage of 
time. Alternatively, it is possible to change the level val- 
ue based on the combination of the passage of time and 
the number or types of phrases recognised. 
50 Working example 3 is characterised in that it pro- 
vides an illusion that the stuffed toy is growing up like a 
living creature as time passes. In other words, the 
stuffed toy can only respond with "Bow-wow" to "Good 
morning" on the first day after being purchased because 
55 the response level is only 1 . However, on the second 
day, it can respond with "G-o-o-d morning" to "Good 
morning" on the second day because the response level 
is 2. Furthermore, after several days, the stuffed toy can 
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respond with "It's a nice day, isn't it?" to "Good morning" 
because of a higher level. 

Although the unit of time for increasing the response 
content by one level was set at 1 day (24 hours) in the 
above explanation, the unit is not limited to 1 day, and 
it is possible to use a longer or shorter time span for 
increasing the level. Note that it will be possible to any 
reset level increase if a reset switch for resetting the lev- 
el is provided. For example, it will be possible to reset 
the level back to the initial value when level 3 has been 
reached. 

Although the above explanation was provided for 
the response to the phrase "Good morning, B it is not lim- 
ited to "Good morning" and is naturally applicable to up- 
grading of responses to other phrases such as "Good 
night" and "I'm leaving." Take "Good night" for example. 
The content of the response from the stuffed toy in reply 
to "Good night 1 ' can be changed from "Unn-unn" (puppy 
cry) in level 1 . to "G-o-o-d nigh-t" in level 2. 

By increasing the level of response content in this 
way, the stuffed toy dog can be made to appear to be 
changing the content of its response as it grows. The 
toy can then be made to act like a living creature by mak- 
ing it respond differently as time passes even when the 
same phrase "Good morning" is recognised. Further 
more, the toy is not boring because it responds with dif- 
ferent phrases even when the speaker says the same 
thing. 

Working example 3 is also useful for training the 
speaker to find out the best way to speak to the toy in 
order to obtain a high recognition rate when the toy's 
response content level value is still low. That is, when 
the speaker does not pronounce "Good morning" in a 
correct way, the "Good morning" will not be easily rec- 
ognised, often resulting in a low recognition rate. How- 
ever if the toy responds with "Bow-wow" to "Good morn- 
ing," this means that the "Good morning" was correctly 
recognised, therefore, if the speaker practices to speak 
in a recognisable manner early on, the speaker learns 
how to speak to be recognised. Consequently, the 
speaker's phrases will be recognised at high rates even 
when the response content level value increases, result- 
ing in smooth interactions. 

(Working example 4) 

Next, Working example 4 of the invention will be ex- 
plained with reference to Figure 4. Note that stuffed toy 
dog 30, motion mechanism 10 for moving the mouth of 
the stuffed toy, etc. shown in Figure 1 are omitted from 
Figure 4. In Working example 4 temperature is detected 
as one of the variable data that affect the interaction, 
and the change in temperature is used for changing the 
content of the response from response content creation 
area 33 shown in Working example 3 above. Tempera- 
ture sensor 34 is provided in Figure 4, and like symbols 
are used to represent like parts as in Figure 3. When it 
receives the recognition data from speech recognition 



area 3, said response content creation area 33 deter- 
mines the response content for stuffed toy 30 based on 
the recognition data and the temperature data from tem- 
perature sensor 34. The specific processing details will 
5 be explained later in the document. 

In Figure 4, the speech that is input from micro- 
phone 1 is analysed by speech analysis area 2, and a 
speech data pattern matching the characteristic volume 
of the input speech is created. This speech data pattern 
is input into the input area of the neural network provided 
in speech recognition area 3. and is recognised as a 
speech. 

Suppose that a phrase "Good morning" issued by 
a non-specific speaker is input into microphone 1 . The 
characteristics of this speaker's "God morning" are an- 
alysed by speech analysis area 2 and are input into 
speech recognition area 3 as a speech data pattern. 

Said speech data pattern of "Good morning" that 
was input into the neural network of speech recognition 
area 3 in this way is output from the output area of the 
neural network as a recognition data possessing a val- 
ue, instead of a binary data. If the recognition data for 
the phrase "Good morning" is higher than those recog- 
nition data for other phrases, speech recognition area 3 
correctly recognises the speaker's "Good morning" as 
"Good morning." 

The recognition data for the phrase "Good morning" 
thus recognised is input into response content creation 
area 33. Response content creation area 33 then deter- 
mines the response content for the input recognition da- 
ta, based on the input recognition data and the temper- 
ature data from temperature sensor 34. 

Therefore, the data content of the response to the 
recognition data that is output by speech recognition ar- 
ea 3 can be created according to the current tempera- 
ture. For example, suppose that the speaker's "Good 
morning" is correctly recognised by speech recognition 
area 3 as "Good morning." Response content creation 
area 33 then creates response data "Good morning. It's 
a bit cold, isn't it?" in reply to the recognition data "Good 
morning" if the current temperature is low. On the other 
hand, response data "Good morning. It's a bit hot, isn't 
it?" is created in reply to the same recognition data 
"Good morning" if the current temperature is higher. The 
response data related by response content creation ar- 
ea 33 is input into speech synthesis area 6 and drive 
control area 7. The speech data input into synthesis ar- 
ea 6 is converted into speech synthesis data, and is out- 
put by speaker 8 embedded in the body of the stuffed 
toy dog. The recognition data input into drive control ar- 
ea 7 drives motion mechanism 1 0 (see Figure 1 ) accord- 
ing to the corresponding pre-determined drive condition 
and moves the mouth of the stuffed toy while the re- 
sponse is being issued. 

In this way. the stuffed toy dog can be made to be- 
have as if it sensed a change in the temperature in its 
environment and responded accordingly. The toy can 
then be made to act like a living creature by making it 
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respond differently as the surrounding temperature 
changes even when the same phrase "Good morning" 
is recognised. Furthermore, the toy is not boring be- 
cause it responds with different phrases even when the 
speaker says the same thing. s 

Working example 5) 



area 3 as "Good morning." Response content creation 
area 33 then creates response data "Good morning, the 
weather is going to get worse today." in reply to the rec- 
ognition data "Good morning" if the air pressure has fall- 
en. On the other hand : response data "Good morning. 
The weather is going to get better today. " is created in 
reply to the recognition data "Good morning" if the air 
pressure has risen. The response data created by re- 
sponse content creation area 33 is input into speech 
synthesis area 6 and drive control area 7. The speech 
data input into synthesis area 6 is converted into speech 
synthesis data, and is output by speaker 8 embedded 
in the body of the stuffed toy dog. The recognition data 
input into drive control area 7 drives motion mechanism 
10 (see Figure 1) according to the corresponding pre- 
determined drive condition and moves the mouth of the 
stuffed toy while the response is being issued. 

In this way, the stuffed toy dog can be made to be- 
have as if it sensed a change in the air pressure in its 
environment and responded accordingly. The toy can 
then be made to act like a living creature by making it 
respond differently as the air pressure changes even 
when the same phrase "Good morning" is recognised. 
Furthermore, the toy is not boring because it responds 
with different phrases even when the speaker says the 
same thing. 

(Working example 6) 

Next, Working example 6 of the invention will be ex- 
plained with reference to Figure 6. Note that stuffed toy 
dog 30, motion mechanism 10 for moving the mouth of 
the stuffed toy, etc. shown in Figure 1 are omitted from 
Figure 6. In Working example 6, calendar data is detect- 
ed as one of the variable data that affect the interaction, 
and the change in calendar data (change in date) is used 
for changing the content of the response. The configu- 
ration in Figure 6 is different from those in Figures 4 and 
5 in that calendar area 36 is provided in place of tem- 
perature sensor 34 or air pressure sensor 35, and like 
symbols are used to represent like parts as in Figures 
4 or 5. Note that said calendar area 36 updates the cal- 
endar by referencing the time data from the clock area 
(not shown in the figure). Response content creation ar- 
ea 33 in Working example 6 receives speech recogni- 
tion data from speech recognition area 3, and deter- 
mines the response content for the stuffed toy based on 
the recognition data and the calendar data from calen- 
dar area 36. The specific processing details will be ex- 
plained later in the document. 

In Figure 6 ; the speech that is input from micro- 
phone 1 is analysed by speech analysis area 2 : and a 
speech data pattern matching the characteristic volume 
of the input speech is created. This speech data pattern 
is input into the input area of the neural network provided 
in speech recognition area 3 and is recognised as a 
speech. 

Suppose that a phrase "Good morning" issued by 



Next, Working example 5 of the invention will be ex- 
plained with reference to Figure 5. Note that stuffed toy 10 
dog 30, motion mechanism 10 for moving the mouth of 
the stuffed toy, etc. shown in Figure 1 are omitted from 
Figure 5. In Working example 5, air pressure is detected 
as one of the variable data that affect the interaction, 
and the change in air pressure (good or bad weather) is is 
used for changing the content of the response from re- 
sponse content creation area 33 shown in Working ex- 
ample 3 above. Air pressure sensor 35 is provided in 
Figure 5, and like symbols are used to represent like 
parts as in Figure 3. Said response content creation ar- 20 
ea 33 receives the recognition data from speech recog- 
nition area 3, and determines the response content for 
the stuffed toy based on the recognition data and the air 
pressure from air pressure sensor 35 and the specific 
processing details will be explained later in the docu- 2s 
ment. 

In Figure 5, the speech that is input from micro- 
phone 1 is analysed by speech analysis area 2. and a 
speech data pattern matching the characteristic volume 
of the input speech is created. This speech data pattern 30 
is input into the input area of the neural network provided 
in speech recognition area 3, and is recognised as 
speech. 

Suppose that a phrase "Good morning" issued by 
a non-specific speaker is input into microphone 1. The 35 
characteristics of this speaker's "Good morning" are an- 
alysed by speech analysis area 2 and are input into 
speech recognition area 3 as a speech data pattern. 

Said speech data pattern of "Good morning" that 
was input into the neural network of speech recognition 40 
area 3 in this way is output from the output area of the 
neural network as a recognition data possessing a val- 
ue, instead of a binary data. If the recognition data for 
the phrase "Good morning" is higher than those recog- 
nition data for other phrases, speech recognition area 3 45 
correctly recognises the speaker's "Good morning" as 
"Good morning." 

The recognition data for the phrase "Good morning" 
thus recognised is input into response content creation 
area 33. Response content creation area 33 then deter- so 
mines the response content for the input recognition da- 
ta, based on the input recognition data and the air pres- 
sure data from air pressure sensor 3. 

Therefore, the data content of the response to the 
recognition data that is output by speech recognition ar- $s 
ea 3 can be created according to the current air pres- 
sure. For example, suppose that the speaker's "Good 
morning is correctly recognised by speech recognition 
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a non-specific speaker is input into microphone 1. The 
characteristics of this speaker's "Good morning" are an- 
alysed by speech analysis area 2 and are input into 
speech recognition area 3 as a speech data pattern. 

Said speech data pattern of "Good morning" that 
was input into the neural network of speech recognition 
area 3 in this way is output from the output area of the 
neural network as a recognition data possessing a val- 
ue, instead of a binary data. If the recognition data for 
the phrase "Good morning" is higher than those recog- 
nition data for other phrases, speech recognition area 3 
correctly recognises the speaker's "Good morning' 1 as 
"Good morning." 

The recognition data for the phrase "Good morning" 
thus recognised is input into response content creation 
area 33. Response content creation area 33 then deter- 
mines the response content for the input recognition da- 
ta based on the input recognition data and the calendar 
data (date information which can also include year data) 
from calendar area 36. 

Therefore, the data content of the response to the 
recognition data that is output by speech recognition ar- 
ea 5 can be created according to the current date. For 
example : suppose that the speaker's "Good morning" is 
correctly recognised by speech recognition area 3 as 
"Good morning." Response content creation area 33 
then creates response data "Good morning. Please take 
me to cherry blossom viewing. " in reply to the recogni- 
tion data "Good morning" if the calendar data shows 
April 1 . On the other hand, response data "Good morn- 
ing. Christmas is coming soon." is created in reply to the 
same recognition data "Good morning" if the calendar 
data shows December 23. Naturally it is possible to cre- 
ate a response that is different from the previous year if 
the year data is available. 

The response data created by response content 
creation area 33 is input into speech synthesis area 6 
and drive control area 7. The speech data input into syn- 
thesis area 6 is converted into speech synthesis data, 
and is output by speaker 8 embedded in the body of the 
stuffed toy dog. The recognition data input into drive 
control area 7 drives motion mechanism 10 (see Figure 
1 ) according to the corresponding pre-determined drive 
condition and moves the mouth of the stuffed toy while 
the response is being issued. 

In this way : the stuffed toy dog can be made to be- 
have as if it sensed a change in the date and responded 
accordingly. The toy can then by made to act like a living 
creature by making it respond differently as the date 
changes even when the same phrase "Good morning" 
is recognised. Furthermore, the toy is not boring be- 
cause it responds with different phrases even when the 
speaker says the same thing. 

Although several working examples were used for 
explaining the present invention, the invention can be 
widely applied to electronic instruments that are used 
daily, such as personal digital assistants and interactive 
games, in addition to toys. Furthermore, in the third and 



subsequent working examples, speech recognition area 
3 can obtain the final recognition that using weighting 
coefficients that take into consideration the appropriate- 
ness of the content of the speaker's phrase relative to a 
5 variable data such as time of day as in Working example 

1 or 2, or can obtain the final recognition data using 
some other method. For example, if the final recognition 
data is obtained as in Working example 1 or 2 and the 
response content for this final recognition data is proc- 

w essed as explained in Working examples 3 through 6, 
the speaker's phrases can be successfully recognised 
at high rates, and the response to the speaker's phrases 
can match the prevailing condition much better. Addi- 
tionally, by using all of the response content processes 

75 explained in Working examples 3 through 6 or in some 
combinations, the response can match the prevailing 
condition much better. For example, if Working example 

2 is combined with Working example 3, and the temper- 
ature sensor, the air pressure sensor, and the calendar 

20 area explained in Working examples 4 through 6 are 
added, accurate speech recognition can be performed 
that takes into consideration appropriateness of the con- 
tent of the speaker's phrase relative to time of day, and 
it is possible to enjoy changes in the level of the re- 

25 sponse content from the stuffed toy as time passes. Fur- 
thermore, interactions that take into account information 
such as temperature, weather, and date become possi- 
ble, and thus an extremely sophisticated interactive 
speech recognition device can be realised. 

30 Thus, the interactive speech recognition device of 
the present invention generates a weighting coefficient 
that changes as variable data changes by matching the 
content of each recognition target speech, and outputs 
recognition data from a speech recognition means by 

35 taking this weighting coefficient into consideration. 
Therefore, even if the recognition target speeches con- 
tain speech data patterns possessing similar input 
speech patterns, said weighting coefficient can assign 
higher priority to the recognition data of the input speech 

JO than to other catalogued recognition data. As a result, 
greeting phrases related to time of day, weather date, 
etc. are recognised by considering the prevailing condi- 
tion, thus significantly improving the recognition rates. 
Furthermore when time data is used as the variable 

^5 data, a weighting coefficient that changes as time data 
changes is generated by matching the content of each 
recognition target speech, and recognition data is output 
from a speech recognition means by taking this weight- 
ing coefficient into consideration. Therefore, the recog- 

50 nition rates can be significantly improved for time-relat- 
ed greeting phrases such as "Good morning" and "Good 
night" that are used frequently. 

Additionally, when time data is used as the variable 
data, the time at which an input speech is correctly rec- 

55 ognised by said speech recognition means is taken from 
said clock means so that the weighting coefficient for 
said speech is changed according to the time data for 
the correct recognition. Thus input speeches are recog- 
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nised based on the recognition data calculated by taking Claims 
the weighting coefficient into consideration. Therefore, 
the recognition rates can be significantly improved for 1. 
time-related greeting phrases such as "Good morning" 
and "Good night" that are used frequently. Furthermore, s 
the time at which input speech is correctly recognised 
is always detected, and a weighting coefficient is deter- 
mined based on the past recognition items of said 
speech, and thus it becomes possible to set weighting 
coefficients that match the actual usage conditions. w 

Time data and/or the recognition count data for cor- 
rect recognition by said speech recognition means are 
input, the response content level for changing the re- 
sponse content for the input speech is generated based 
on the input data, and a response content that matches is 
this response level is output. Therefore, the response 
content level can be changed in stages in response to 
the phrase issued by the speaker. For example, when 
the invention is used in a stuffed toy animal, the increas- 
ing response content level gives the illusion that the toy 20 
animal is growing and changing its response as it grows. 
The toy can then be made to act like a living creature by 
making it respond differently as time passes even when 
the same phrase "Good morning" for example is recog- 
nised. Furthermore the invention provides an excellent 25 
effect in that the toy is not boring because it responds 
with different phrases even when the speaker says the 
same thing. Additionally, the invention provides the ef- 
fect of enabling smooth interaction because the speak- 
er's phrases will necessarily be recognised at higher 30 
rates when the response content level value increases, 
if the speaker learns to speak to the toy in recognisable 
ways when the toy's response content level value is still 
low. 

Additionally, variable data which detects variable 
data that affects the response content is detected, and 
response content that takes this variable data into con- 
sideration is output. Therefore, sophisticated interac- 
tions become possible that correspond to various situ- 
ational changes. 

The temperature of the surrounding environment 
may be measured as said variable data, and the re- 
sponse content is output based on this temperature da- 
ta. Therefore, sophisticated interactions become possi- 
ble that tailor the response to the prevailing tempera- 45 
ture. 

The air pressure of the surrounding environment 5. 
may be measured as said variable data, and the re- 
sponse content is output based on this air pressure data. 
Therefore, sophisticated interactions become possible so 
that tailor the response to the prevailing weather condi- 
tion. 

Finally, calendar data may be used as said variable 
data, and the response content is output based on this 
calendar data. Therefore, sophisticated interactions be- ss 
come possible that tailor the response to the calendar 6. 



An interactive speech recognition device for recog- 
nising and responding to input speech comprising: 

a speech analysing means (2) for analysing in- 
put speech by comparing it to pre-registered 
speech patterns and for creating a speech data 
pattern: 

a speech recognition means (3) for recognising 
the input speech by analysing the speech data 
pattern and deriving recognition data: 
a speech output means (6-8) for outputting a 
response to said input speech using said rec- 
ognition data; and characterised by 
a coefficient setting means (4) for generating 
weighted coefficients for each of the pre-regis- 
tered speech patterns and providing said 
speech recognition means with said coeffi- 
cients thereby enabling the recognition data ac- 
curacy to be improved. 



35 4. 
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An interactive speech recognition device as 
claimed in claim 1 , further comprising a variable da- 
ta detection means. 

An interactive speech recognition device according 
to Claim 2, in which said variable data detection 
means comprises a timing means for detecting time 
data, and in that said coefficient setting means gen- 
erates a weighting coefficient that corresponds to 
the time of day for each of the pre-registered speech 
patterns. 

An interactive speech recognition device according 
to Claim 3, in which the time at which an input 
speech is correctly recognized by said speech rec- 
ognition means is fetched from said timing means 
the weighting coefficient for the recognition data is 
given the largest value if the input speech occurs at 
a time at which it was correctly recognized most fre- 
quently in the past, and a smaller weighting coeffi- 
cient is given as the time deviates from this peak 
time, based on the time data for the correct recog- 
nition. 

An interactive speech recognition device according 
to Claim 4 in which said variable data detection 
means comprises a temperature sensor that meas- 
ures the temperature of the usage environment and 
outputs the temperature data, and said response 
content creation means outputs the response con- 
tent data by taking said temperature data into con- 
sideration. 

An interactive speech recognition device according 
to Claim 4 in which said variable data detection 
means comprises an air pressure sensor that meas- 
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ures the air pressure of the usage environment and 
outputs the air pressure data, and said response 
content creation means outputs the response con- 
tent data by taking said air pressure data into con- 
sideration. 5 

7. An interactive speech recognition device according 
to Claim 4 in which said variable data detection 
means comprises a calendar detection means that 
detects calendar data and outputs the calendar da- 10 
ta, and said response content creation means out- 
puts the response content data by taking said cal- 
endar data into consideration. 

8. An interactive speech recognition device as 15 
claimed in any one of Claims 5 to 7, in which said 
variable data detection means comprises two or 
more of a temperature sensor air pressure sensor 

or calendar detection means. 
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(57) The present invention relates to an interactive 
speech recognition device that recognises speech and 
produces sounds or actions in response to the recogni- 
tion result. 

The device includes a microphone (1), a speech 
analysis area (2), a recognition area (3), a coefficient 



setting means (4) and output means (6 : 7.8: 11-16). The 
coefficient setting means (4) enables the airway of the 
output to be improved. Additional features include a 
temperature sensor, air pressure sensor calendar 
means to improve the airway further and to enable the 
output to be adaptive. 
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